Contents

Exploratory analysis of spatiotemporal data

Types of spatiotemporal data

In space, we can have:

  • Lattice data: observed on a finite grid, like counties or ZIP codes (e.g., number of COVID cases in each NYC ZIP)
  • Geostatistical data: observed at continuous spatial locations (e.g., temperature or wind speed data)
  • Point process data: locations of observations are themselves random (e.g., bird nest locations across NYC)

In time, we can have:

  • Deterministic time points that are regularly or irregularly spaced
  • Random time points

Materials needed for studying spatiotemporal data

  • Spatiotemporal measurements: often in tabular form, e.g. one row per location and one column per time point, or vice versa
  • Spatial shapefile: collection of files describing the geometry of a space in the form of polygons or lines. Must include .shp, .shx, and .dbf files.
  • R packages: useful packages include rgdal (includes the readOGR() function for reading .shp files), spdep, maptools, and ggplot2

Example dataset: COVID-19 hospitalizations in NYC, March 2020

In this analysis, we will study a dataset that records the number of people from each NYC ZIP code who were hospitalized for (potential) COVID-19 each day in March 2020. This data comes from the NYC Department of Health and Mental Hygiene; my colleagues and I study it in this research paper.

# Read the COVID hospitalization data from a CSV, specifying that the first column
# contains the "row names" for this dataset (these are the ZIP codes). The 
# 'check.names' option tells R to not modify the column names, which are dates.
NYC_COVID <- read.csv("Data/ZipcodeHospitalizations.csv", row.names=1, check.names=F)

# Save the dates for future use
dates <- as.Date(colnames(NYC_COVID))

# Print first 5 rows and first 7 columns of the data
head(NYC_COVID, n=c(5,7))
##       2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06
## 10001         10         11          6         11          8          5
## 10002          9         19         23         17         15         18
## 10003          3          8          9          6          7          9
## 10004          0          0          3          0          2          1
## 10005          0          0          0          4          0          2
##       2020-03-07
## 10001          7
## 10002         21
## 10003          7
## 10004          0
## 10005          0

Next we read in the shapefile that defines ZIP code boundaries for NYC. The shapefile can be downloaded from NYC Open Data at this link.

# Load the rgdal package, which contains the readOGR function for reading shapefiles
library(rgdal)

# Read NYC shapefile
NYC_SHP <- readOGR("Shapefiles/ZIP_CODE_040114.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/saravenkatraman/Library/Mobile Documents/com~apple~CloudDocs/Documents/Cornell University (PhD)/Presentations/Spatial Modeling/Shapefiles/ZIP_CODE_040114.shp", layer: "ZIP_CODE_040114"
## with 263 features
## It has 12 fields

It looks like the shapefile has data about 263 ZIP codes in NYC whereas the COVID dataset has only 173, a few of which are actually outside the city. Before we proceed, we’ll restrict both the shapefile and the COVID dataset to include the same ZIPs.

# Find which ZIP codes are in both the NYC shapefile and the COVID dataset.
# Recall that the row names of the COVID dataset are ZIPs.
zips <- intersect(NYC_SHP$ZIPCODE, rownames(NYC_COVID))

# Cut down the shapefile to just these ZIPs
NYC_SHP <- NYC_SHP[NYC_SHP$ZIPCODE %in% zips, ]

# Cut down the COVID dataset to just these ZIPs. Note that this also reorders
# the rows to match the order of the ZIPs in the shapefile.
NYC_COVID <- NYC_COVID[zips, ]

We are left with 169 ZIP codes, but the shapefile still contains 178 rows. This is because there is duplicate information for a few of the ZIPs. We’ll remove these duplicates:

NYC_SHP <- NYC_SHP[! duplicated(NYC_SHP$ZIPCODE), ]

It turns out that this shapefile also contains the populations of each ZIP code, which we’ll need for our analyses. We’ll save this information in its own dataframe.

# Save populations of each ZIP code into a dataframe 
NYC_Populations <- data.frame(pop = NYC_SHP$POPULATION, row.names = zips)

# Print out the first few rows
head(NYC_Populations)
##          pop
## 11213  62426
## 11212  83866
## 11225  56527
## 11218  72280
## 11226 106132
## 11219  92561

The ZIP code 10162 is missing population data, so we will fill it in using 2020 Census data.

NYC_Populations[which(rownames(NYC_Populations) == "10162"), "pop"] <- 1814

Choropleth mapping

The ggplot2 package can be used to visualize spatial data on maps, and the plotly package can be used to make the plot interactive, i.e. to display information upon hovering or clicking on the plot. First we’ll draw a map of the total number of hospitalizations from each ZIP code.

Note: We should probably plot the hospitalization rate instead, e.g. the fraction of each ZIP’s population that was hospitalized, but we’ll stick to raw counts just for demonstration. To get the rate, we would divide rowSums(NYC_COVID) by NYC_Populations$pop in the code below.

# Load ggplot2 package for plotting, plotly package for interactive plotting, 
# and viridis package for color palettes
library(ggplot2)
library(plotly)
library(viridis)
library(maptools)

# Compute cumulative hospitalization counts
COVID_cumulative <- data.frame(ZIP = zips, Count = rowSums(NYC_COVID))

# Convert spatial polygons from the shapefile into a dataframe that gives the 
# latitude and longitude for each point defining the ZIP code boundaries
# plotData <- fortify(NYC_SHP, region = "ZIPCODE")
plotData <- broom::tidy(NYC_SHP) 

# Add an ID column to cumulative COVID dataset to enable merging with plotData
COVID_cumulative$ID <- unique(plotData$id)

# Merge hospitalization rates with map data. Note that in the plotData dataframe,
# the ZIP codes are in the 'id' column.  
plotData <- merge(plotData, COVID_cumulative, by.x = "id", by.y = "ID")

# Construct map: id (ZIP code) is the text that'll be shown when hovering over plot
map1 <- ggplot(plotData, aes(x = long, y = lat, text = ZIP)) +
  # Draw polygons colored according to the hosp. rate, with a black border
  geom_polygon(aes(group = group, fill = Count), color = "black", size = 0.2) +
  # Specify purple color palette, with darker shades indicating higher rate
  scale_fill_distiller(palette = "Purples", direction = 1) +
  # Set plot title
  ggtitle("Total COVID-related hospitalization rates, March 2020") +
  # Remove background and axes, and center the plot title
  theme_void() + theme(plot.title = element_text(hjust = 0.5))            

# Draw the interactive map
ggplotly(map1)

To visualize the temporal dynamics of disease spread, we can look at the cumulative hospitalizations at several intermediate points in time. Below, we plot the cumulative hospitalization counts after each week in March 2020, i.e. after 8, 15, 23, and 30 days.

# Compute cumulative hospitalization rates after 8, 15, 23, 30 days
COVID_cumulative <- data.frame(ZIP = zips, sapply(c(8, 15, 23, 30), function(x) rowSums(NYC_COVID[, 1:x])))
colnames(COVID_cumulative)[-1] <- c("Week 1", "Week 2", "Week 3", "Week 4")

# Convert spatial polygons from the shapefile into a dataframe that gives the 
# latitude and longitude for each point defining the ZIP code boundaries
plotData <- broom::tidy(NYC_SHP)

# Add an ID column to cumulative COVID dataset to enable merging with plotData
COVID_cumulative$ID <- unique(plotData$id)

# Merge hospitalization rates with map data. Note that in the plotData dataframe,
# the ZIP codes are in the 'id' column. 
plotData <- merge(plotData, reshape2::melt(COVID_cumulative), by.x = "id", by.y = "ID")

# Rename the "variable" and "value" columns of plotData to "Week" and "Rate
colnames(plotData)[which(colnames(plotData) == "variable")] <- "Week"
colnames(plotData)[which(colnames(plotData) == "value")] <- "Count"

# Construct map: id (ZIP code) is the text that'll be shown when hovering over plot
map2 <- ggplot(plotData, aes(x = long, y = lat, text = ZIP)) +
  # Draw polygons colored according to the hosp. rate, with a black border
  geom_polygon(aes(group = group, fill = Count), color="black", size=.2) +
  # Put the rates for each week on a separate plot, and arrange plots into 4-column layout
  facet_wrap(~ Week, ncol = 2) +
  # Specify purple color palette, with darker shades indicating higher rate
  scale_fill_distiller(palette = "Purples", direction = 1) +
  # Set plot title
  ggtitle("Cumulative COVID-related hospitalizations, March 2020\n") + 
  # Remove background and axes, and center the plot title
  theme_void() + theme(plot.title = element_text(hjust = 0.5))

# Draw the interactive map, statically or interactively
map2

# ggplotly(map2)

The above maps help visualize the temporal evolution of spatially-localized outbreaks. In the Jackson Heights/Corona area of Queens, for instance, a severe outbreak seems to have been relatively contained within the neighborhood and the ZIP codes immediately surrounding it. By contrast, outbreaks originating in the Melrose area of the Bronx and East Flatbush in Brooklyn seem to have spread to neighboring ZIP codes somewhat more evenly.

Temporal plots

To visualize city-wide variation in hospitalizations over time, we will plot two time series side-by-side:

  1. The time series of daily total hospitalizations
  2. The time series of cumulative hospitalizations
# Create a dataframe comprised of the dates in March, the hospitalization counts each
# day, and the cumulative hospitalization counts each day. Note that we obtain
# the daily count by taking the sum of each column in NYC_COVID.
plotData <- data.frame(date = dates, count = colSums(NYC_COVID), cumu_count = cumsum(colSums(NYC_COVID)))

# Construct first plot: x-axis is time, y-axis is hospitalizations
p1 <- ggplot(plotData, aes(x = date, y = count)) +
  # Draw points and lines, set axis labels
  geom_point() + geom_line() + labs(x = "Date", y = "Hospitalization count") +
  # Set plot title
  ggtitle("COVID-19 hospitalizations in NYC, March 2020") +
  # Center the plot title
  theme(plot.title = element_text(hjust = 0.5))
  
# Construct second plot: x-axis is time, y-axis is cumulative hospitalizations
p2 <- ggplot(plotData, aes(x = date, y = cumu_count)) +
  # Draw points and lines, set axis labels
  geom_point() + geom_line() + labs(x = "Date", y = "Cumulative hospitalization count") +
  # Set plot title
  ggtitle("Cumulative COVID-19 hosp. in NYC, March 2020") +
  # Center the plot title
  theme(plot.title = element_text(hjust = 0.5))
  
# Load package for arranging ggplots on a grid
library(gridExtra)
grid.arrange(p1, p2, ncol=2)

  • The left time series plot shows us that daily hospitalization counts increased until approximately the third week of March. Note that there appears to be some seasonality in the counts, i.e. counts appear to drop around the end of each week and rise back up at the beginning of the next. We later found that this was due to certain data collection practices by which weekend hospitalizations were not recorded until the following Monday. This can be corrected via smoothing if needed.
  • The right time series plot suggests that the hospitalizations seem to have increased more or less linearly over time. It appears that around March 9, the hospitalization count began to grow a bit faster than it had the week before.

Spatial correlation

We will next review ways to assess the presence of spatial autocorrelation, or spatial dependence, in our dataset. That is, how similar are ZIP code hospitalization counts are to those in neighboring ZIPs?

The Moran’s I-statistic is a measure of spatial dependence that we can use to investigate this question. If \(x_1,...,x_N\) are measurements (e.g., COVID case counts) at \(N\) spatial locations (e.g. ZIP codes), the Moran’s \(I\)-statistic is defined as:

\[\text{Moran's } I \;\;=\;\; \frac{\sum_{i=1}^N \text{covariance between location } i \text{ and its neighbors}}{\text{variance over all locations}} \;\;=\;\; \frac{N\sum_{i=1}^N \sum_{j=1}^N w_{ij}(x_i-\bar x)(x_j-\bar x)}{W\sum_{i=1}^N (x_i-\bar x)^2}\]

where:

  • \(\bar x = \frac{1}{N}\sum_{i=1}^n x_i\), the overall average (i.e. average hospitalization count)
  • \(w_{ij}\) are weights between spatial locations. Typically \(w_{ij} = 1\) if locations \(i\) and \(j\) are neighbors and 0 otherwise
  • \(W\) is the sum of all the \(w_{ij}\)
  • \(N\) is the total number of locations

The Moran’s \(I\)-statistic is between \(-1\) (negative correlation) and \(1\) (positive correlation), and we can test for its statistical significance. Below, we compute the Moran’s \(I\) associated with the total hospitalization count in each ZIP code. First, we need to identify each ZIP’s set of neighbors:

# Load spdep package for the poly2nb function, which will identify neighbors
library(spdep)

# Construct an adjacency matrix to identify each ZIP code's set of neighbors
NYC_neighbors <- poly2nb(NYC_SHP)

We’ll print out the first few entries of the NYC_neighbors list:

head(NYC_neighbors)
## [[1]]
## [1]   2   3 112 155 157
## 
## [[2]]
## [1]   1 102 112 114 157
## 
## [[3]]
## [1]   1   5 105 110 112 155
## 
## [[4]]
## [1]   5   6   8   9 110 111
## 
## [[5]]
## [1]   3   4   7   8 110 112
## 
## [[6]]
## [1]   4   9 111 115 123

This tells us that ZIP code #1 has ZIPs 2, 3, 112, 155, and 157 as its neighbors, for instance. We can see check which ZIP codes these correspond to:

zips[c(1, NYC_neighbors[[1]])]
## [1] "11213" "11212" "11225" "11203" "11216" "11233"

ZIP code #1 is 11213 (Crown Heights, Brooklyn), and the other ZIP codes immediately surround it.

Further inspection of the NYC_neighbors list reveals that several ZIPs are not assigned to have any neighbors, which will produce errors for subsequent calculations. We resolve this by manually assigning these neighbors. First we check which ZIPs do not have any neighbors recorded:

zips_without_neighbors <- which(sapply(NYC_neighbors, sum) == 0)
zips[zips_without_neighbors]
## [1] "10044" "11222" "11237" "11385" "10162"

We’ll manually set these neighbors:

NYC_neighbors[[zips_without_neighbors[1]]] <- which(zips %in% c("10022", "10065", "10021", "11101"))
NYC_neighbors[[zips_without_neighbors[2]]] <- which(zips %in% c("11101", "11211", "11378")) 
NYC_neighbors[[zips_without_neighbors[3]]] <- which(zips %in% c("11385", "11211", "11206", "11221", "11207"))
NYC_neighbors[[zips_without_neighbors[4]]] <- which(zips %in% c("11379", "11378", "11237", "11207", "11208", "11421", "11375"))
NYC_neighbors[[zips_without_neighbors[5]]] <- which(zips %in% c("10021", "10075"))
# Construct an adjacency matrix to identify each ZIP code's set of neighbors
NYC_neighbors <- poly2nb(NYC_SHP)

# Compute spatial weights: for each ZIP, weights are 1/N, where N = number of neighbors
NYC_neighbors <- nb2listw(NYC_neighbors, zero.policy = TRUE)

# Compute global Moran's I-statistic for total case counts
COVID_moran <- moran.test(rowSums(NYC_COVID), NYC_neighbors, zero.policy = TRUE)

# Compute global Moran's I statistic for cumulative case counts up to each day in March 2020
cumulative_moran <- data.frame(date = dates[-1], statistic = 0, pvalue = 0)
for(i in 2:30) {
  COVID_moran_day <- moran.test(rowSums(NYC_COVID[, 1:i]), NYC_neighbors, zero.policy = TRUE)
  cumulative_moran[i-1, "statistic"] <- COVID_moran_day$estimate[1]
  cumulative_moran[i-1, "pvalue"] <- COVID_moran_day$p.value
}

ggplot(cumulative_moran, aes(x = date, y = statistic)) + geom_point() + geom_line() +
  labs(x="Date", y="Moran's I statistic", title="Moran's I statistic for cumulative hospitalizations up to each day") + theme(plot.title=element_text(hjust=0.5))

# Local Moran's I statistic
cumulative_local_moran_stat <- data.frame(matrix(0, nrow = length(zips), ncol = length(dates)-1))
colnames(cumulative_local_moran_stat) <- dates[-1]
rownames(cumulative_local_moran_stat) <- zips
cumulative_local_moran_pvalue <- cumulative_local_moran_stat
for(i in 2:30) {
  COVID_moran_day <- localmoran(rowSums(NYC_COVID[, 1:i]) / NYC_Populations$pop, NYC_neighbors, zero.policy = TRUE)
  cumulative_local_moran_stat[, i-1] <- COVID_moran_day[,1]
  cumulative_local_moran_pvalue[, i-1] <- COVID_moran_day[,5]
}

# Plot local Moran's statistics for 11368 (Jackson Heights/Corona), 11203 (East Flatbush), 10002 (LES), 10451 (Melrose)
plotData <- cumulative_local_moran_stat[c("10002", "11203", "11368", "10451"),]
rownames(plotData) <- c("10002 (Lower East Side, Manhattan)", "11203 (East Flatbush, Brooklyn)", "11368 (Jackson Heights/Corona, Queens)", "10451 (Melrose, Bronx)")
plotData <- reshape2::melt(t(plotData))
colnames(plotData) <- c("Date", "zip", "I")
p1 <- ggplot(plotData, aes(x = as.Date(Date), y = I)) + geom_point() + geom_line() + facet_wrap(~ zip, nrow=1) +
  labs(x = "Date", y = "Local Moran's I statistic", title = "Local Moran's I statistic for cumulative hospitalizations up to each day in select ZIP codes") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1))

# Plot the corresponding p-values
plotData <- cumulative_local_moran_pvalue[c("10002", "11203", "11368", "10451"),]
rownames(plotData) <- c("10002 (Lower East Side, Manhattan)", "11203 (East Flatbush, Brooklyn)", "11368 (Jackson Heights/Corona, Queens)", "10451 (Melrose, Bronx)")
plotData <- reshape2::melt(t(plotData))
colnames(plotData) <- c("Date", "zip", "p")
p2 <- ggplot(plotData, aes(x = as.Date(Date), y = p)) + geom_point() + geom_line() + facet_wrap(~ zip, nrow = 1) +
  labs(x = "Date", y = "Local Moran's I p-value", title = "Local Moran's I p-value") +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "blue") + theme(plot.title=element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, nrow = 2)

We can also identify the spatial clustering pattern of each area and plot it on a choropleth.

COVID_cumulative_LISA <- attr(localmoran(rowSums(NYC_COVID) / NYC_Populations$pop, NYC_neighbors, zero.policy = TRUE), "quadr")["mean"]
COVID_cumulative_LISA$zip <- zips
colnames(COVID_cumulative_LISA)[1] <- "LISA"

# Mark non-significant local statistics
COVID_cumulative_LISA$LISA <- as.character(COVID_cumulative_LISA$LISA)
COVID_cumulative_LISA$LISA[which(cumulative_local_moran_pvalue[, 29] > 0.05)] <- "Not significant"

# As before, convert spatial polygons from the shapefile into a longitude-latitude 
# dataframe and merge the clustering results
plotData <- broom::tidy(NYC_SHP)
## Regions defined for each Polygons
COVID_cumulative_LISA$ID <- unique(plotData$id)
plotData <- merge(plotData, COVID_cumulative_LISA, by.x = "id", by.y = "ID")

# Draw the map
ggplot(plotData, aes(x = long, y = lat, text = id)) +
  geom_polygon(aes(group = group, fill = LISA), color = "black", size = 0.2) +
  ggtitle("Spatial clusters of COVID-19 hospitalizations, March 2020") +
  scale_fill_manual(values = c("indianred3", "lightpink", "lightskyblue3", "blue", "white")) +
  theme_void() + theme(plot.title = element_text(hjust = 0.5))

Modeling

Modeling spatial dynamics only

We begin by fitting a model of the total hospitalization counts in each ZIP code, so there is no temporal component just yet:

\[\log(\rho_i) = \alpha + \beta_i\]

where:

  • \(\rho_i\) is the fraction of ZIP code \(i\) that was hospitalized in March 2020,
  • \(\alpha\) is the average proportion of individuals hospitalized across NYC,
  • \(\beta_i\) is an effect specific to ZIP code \(i\).

Specifically, \(\beta_i = \beta_{1,i} + \beta_{2,i}\), where \(\beta_{1,i}\) is a spatially-structured residual and \(\beta_{2,i}\) is an unstructured residual. The unstructured residual \(\beta_{2,i}\) is assumed to have a \(N(0, \sigma^2_i)\) distribution. The structured residual \(\beta_{1,i}\) is assumed to have the following conditional distribution:

\[\begin{gather*} \beta_{1,i}|\beta_{1,j\neq i} \sim N(m_i, s_i^2) \\ \text{where } m_i = \text{average of } \beta_{1,j} \text{ for neighboring indices } j \\ \text{and } s_i^2 = \text{average of } \sigma^2_j \text{ for neighboring indices } j \end{gather*}\]

library(INLA)
## Loading required package: Matrix
## This is INLA_24.02.09 built 2024-02-09 03:43:24 UTC.
##  - See www.r-inla.org/contact-us for how to get help.
##  - List available models/likelihoods/etc with inla.list.models()
##  - Use inla.doc(<NAME>) to access documentation
# Gather the data needed for the model
modelData <- data.frame(zipID = 1:length(zips), count = rowSums(NYC_COVID), population = NYC_Populations$pop)

# Create list of each ZIP’s neighbors (this writes a file into the working directory)
nb2INLA("NYC.graph", poly2nb(NYC_SHP))

# Define the formula for this model
model0Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph"))

# Fit the model using the INLA algorithm
spatialModel <- inla(model0Formula, family = "poisson", offset = log(population), data = modelData)

# Look at exponentiated coefficients to get city-wide rate
exp(spatialModel$summary.fixed)
##                   mean       sd 0.025quant   0.5quant 0.975quant       mode kld
## (Intercept) 0.01217149 1.030081  0.0114818 0.01217186 0.01290044 0.01217186   1

The \(f(\cdot)\) formulation allows us to specify the structure of the term inside it. The default option is iid, but we can see all the other possible options as follows:

names(inla.models()$latent)
##  [1] "linear"       "iid"          "mec"          "meb"          "rgeneric"    
##  [6] "cgeneric"     "rw1"          "rw2"          "crw2"         "seasonal"    
## [11] "besag"        "besag2"       "bym"          "bym2"         "besagproper" 
## [16] "besagproper2" "fgn"          "fgn2"         "ar1"          "ar1c"        
## [21] "ar"           "ou"           "intslope"     "generic"      "generic0"    
## [26] "generic1"     "generic2"     "generic3"     "spde"         "spde2"       
## [31] "spde3"        "iid1d"        "iid2d"        "iid3d"        "iid4d"       
## [36] "iid5d"        "iidkd"        "2diid"        "z"            "rw2d"        
## [41] "rw2diid"      "slm"          "matern2d"     "dmatern"      "copy"        
## [46] "scopy"        "clinear"      "sigm"         "revsigm"      "log1exp"     
## [51] "logdist"

We also used family = "poisson" above, but there are many other options for the distribution of the data (likelihood):

names(inla.models()$likelihood)
##  [1] "fl"                            "poisson"                      
##  [3] "nzpoisson"                     "xpoisson"                     
##  [5] "cenpoisson"                    "cenpoisson2"                  
##  [7] "gpoisson"                      "poisson.special1"             
##  [9] "0poisson"                      "0poissonS"                    
## [11] "bell"                          "0binomial"                    
## [13] "0binomialS"                    "binomial"                     
## [15] "xbinomial"                     "pom"                          
## [17] "bgev"                          "gamma"                        
## [19] "gammasurv"                     "gammajw"                      
## [21] "gammajwsurv"                   "gammacount"                   
## [23] "qkumar"                        "qloglogistic"                 
## [25] "qloglogisticsurv"              "beta"                         
## [27] "betabinomial"                  "betabinomialna"               
## [29] "cbinomial"                     "nbinomial"                    
## [31] "nbinomial2"                    "cennbinomial2"                
## [33] "simplex"                       "gaussian"                     
## [35] "stdgaussian"                   "gaussianjw"                   
## [37] "agaussian"                     "ggaussian"                    
## [39] "ggaussianS"                    "rcpoisson"                    
## [41] "tpoisson"                      "circularnormal"               
## [43] "wrappedcauchy"                 "iidgamma"                     
## [45] "iidlogitbeta"                  "loggammafrailty"              
## [47] "logistic"                      "sn"                           
## [49] "gev"                           "lognormal"                    
## [51] "lognormalsurv"                 "exponential"                  
## [53] "exponentialsurv"               "coxph"                        
## [55] "weibull"                       "weibullsurv"                  
## [57] "loglogistic"                   "loglogisticsurv"              
## [59] "stochvol"                      "stochvolsn"                   
## [61] "stochvolt"                     "stochvolnig"                  
## [63] "zeroinflatedpoisson0"          "zeroinflatedpoisson1"         
## [65] "zeroinflatedpoisson2"          "zeroinflatedcenpoisson0"      
## [67] "zeroinflatedcenpoisson1"       "zeroinflatedbetabinomial0"    
## [69] "zeroinflatedbetabinomial1"     "zeroinflatedbinomial0"        
## [71] "zeroinflatedbinomial1"         "zeroinflatedbinomial2"        
## [73] "zeroninflatedbinomial2"        "zeroninflatedbinomial3"       
## [75] "zeroinflatedbetabinomial2"     "zeroinflatednbinomial0"       
## [77] "zeroinflatednbinomial1"        "zeroinflatednbinomial1strata2"
## [79] "zeroinflatednbinomial1strata3" "zeroinflatednbinomial2"       
## [81] "t"                             "tstrata"                      
## [83] "nmix"                          "nmixnb"                       
## [85] "gp"                            "dgp"                          
## [87] "logperiodogram"                "tweedie"                      
## [89] "fmri"                          "fmrisurv"                     
## [91] "gompertz"                      "gompertzsurv"

Now we will plot the ZIP code-specific relative risks of hospitalization on a map.

# Plot the ZIP-specific relative risks of hospitalization
zipResiduals <- data.frame(zip = zips, resid = 0)

# Extract each ZIP's value (the sum of the structured and unstructured parts)
for(i in 1:length(zips))
  zipResiduals[i, "resid"] <- sum(spatialModel$summary.random$zipID$mean[c(i, i + length(zips))])

# Exponentiate the coefficient
zipResiduals$resid <- exp(zipResiduals$resid)

# Draw the map
plotData <- broom::tidy(NYC_SHP)
## Regions defined for each Polygons
zipResiduals$ID <- unique(plotData$id)
plotData <- merge(plotData, zipResiduals, by.x = "id", by.y = "ID")
plotData <- plotData[- which(plotData$zip == "11040"), ] # Remove outlying ZIP from plot
ggplot(plotData, aes(x = long, y = lat, text = zip)) +
  geom_polygon(aes(group = group, fill = resid), color = "black", size = 0.2) +
  scale_fill_distiller(palette = "Purples", direction = 1) +
  ggtitle("ZIP-specific multiplicative factors on city-wide hospitalization rate") +
  theme_void() + theme(plot.title = element_text(hjust = 0.5))

Modeling spatiotemporal dynamics

# Reshape data into long format
modelData <- reshape2::melt(t(cbind(NYC_COVID)))
colnames(modelData) <- c("Date", "ZIP", "count")
modelData$zipID <- sort(rep(1:length(zips), 30))

# Add two time counters: one will be for random walk term, the other will be for iid error term
modelData$time <- modelData$time2 <- rep(1:length(dates), length(zips))

# Add population data
modelData$population <- rep(NYC_Populations$pop, times = rep(length(dates), length(zips)))

# Create list of each ZIP’s neighbors (this writes a file into the working directory)
nb2INLA("NYC.graph", poly2nb(NYC_SHP))

# Define the formula for this model
model1Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph")) + f(time, model = "rw1") + f(time2, model = "iid")

# Fit the model using the INLA algorithm
spatiotemporalModel <- inla(model1Formula, family = "poisson", offset = log(population), data = modelData)

# Look at exponentiated coefficients to get city-wide daily rate
exp(spatiotemporalModel$summary.fixed)
##                     mean       sd  0.025quant     0.5quant   0.975quant
## (Intercept) 0.0003778969 1.030167 0.000356414 0.0003779098 0.0004005968
##                     mode kld
## (Intercept) 0.0003779098   1
# To get a particular ZIP code's fitted values over time, we need to add up the
# two spatial components (structured and unstructured; ZIP-specific) and the two temporal 
# components. We will do this for the four ZIP codes whose local Moran's I statistics
# we looked at previously.
outbreakZips <- c("10002", "11203", "11368", "10451")
zipIDs <- which(zips %in% outbreakZips)
zipFittedValues <- matrix(0, nrow = length(dates), ncol = length(outbreakZips), dimnames = list(as.character(dates), outbreakZips))
for(i in 1:length(outbreakZips)) {
  spatialComponent <- spatiotemporalModel$summary.random$zipID$mean[zipIDs[i]]
  temporalComponent1 <- spatiotemporalModel$summary.random$time$mean
  temporalComponent2 <- spatiotemporalModel$summary.random$time2$mean
  zipFittedValues[, i] <- exp(spatialComponent + temporalComponent1 + temporalComponent2)
}

# Plot the fitted hospitalization rates over time. Notice that they are vertically shifted
# or dampened versions of the same mean pattern, which is what we'd expect from the
# additive nature of how they're constructed (mean time series + spatial terms)
plotData <- reshape2::melt(zipFittedValues)
colnames(plotData) <- c("Date", "ZIP", "Rate")
plotData$ZIP <- as.factor(plotData$ZIP)
ggplot(plotData, aes(x = Date, y = Rate, color = ZIP)) +
  geom_line(aes(group = ZIP)) + scale_color_manual(values = c("red", "forestgreen", "blue", "orange")) +
  theme_bw() + labs(title = "Estimated hospitalization rates in outbreak ZIP codes ") +
  theme(plot.title=element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "bottom")

Forecasting

Forecasting future observations in time can be viewed as fitting the same INLA model with some missing data in the response and computing the predictive distribution of those points. Below, we add missing values for the next week of the COVID dataset and re-fit the spatiotemporal model.

# Create new dataset with NA values for 7 days for all ZIP codes
new_data_matrix <- matrix(NA, nrow = nrow(NYC_COVID), ncol = 7)
rownames(new_data_matrix) <- rownames(NYC_COVID)
colnames(new_data_matrix) <- as.character(seq(from = dates[length(dates)] + 1, to = dates[length(dates)] + ncol(new_data_matrix), by = 1))
NYC_COVID_new <- cbind(NYC_COVID, new_data_matrix)

# Reshape data into long format, and add time counters and population data
modelData <- reshape2::melt(t(cbind(NYC_COVID_new)))
colnames(modelData) <- c("Date", "ZIP", "count")
modelData$zipID <- sort(rep(1:length(zips), ncol(NYC_COVID_new)))
modelData$time <- modelData$time2 <- rep(1:ncol(NYC_COVID_new), length(zips))
modelData$population <- rep(NYC_Populations$pop, times = rep(ncol(NYC_COVID_new), length(zips)))

# Define the formula for this model and fit using the INLA algorithm
model1Formula <- count ~ 1 + f(zipID, model = "bym", graph = paste("NYC.graph")) + f(time, model = "rw1") + f(time2, model = "iid")
spatiotemporalModel <- inla(model1Formula, family = "poisson", offset = log(population), data = modelData, 
                            control.predictor = list(compute = TRUE, link = 1))